Lecture 10: Databases in the Wild

This Cat Earns $150 Per Hour Working at Home as the World’s Grumpiest Data Scientist…

[http://worldofwonder.net/grumpy-cat-lands-broadway-role-else-cats/]

Introduction to Social Media

Almost As Big As It Gets

Data Are Everywhere

Privacy On Social Media is a Myth

What Social Media Companies Want

Understand the Business to Understand the Designs

How We’ll Examine Social Media

  1. Understand the kind of data that they produce.

  2. Investigating the activities, behaviors, and economic value of social media users.

  3. Deriving some general insights about the customers and goals for the business.

  4. Evaluating the design of these sites – and their inherent means of collecting data – in the context of fulfilling these goals.

Projects with a Real Social Media Company

About the Site

Technologically Sophisticated

A Security Protocol

Accessing the data required a number of security measures:

A Brief Introduction to Databases

Most databases are built in some variation of Structured Query Language (SQL). In an SQL database, there is no such thing as a standalone variable. Every piece is a table, which is akin to a data.table object in R.

A Relational Database

Example: Social Media Collections

The Users Table

This is a randomly generated collection of fictional social media users.

users <- fread(input = "Users.csv")
datatable(data = users[1:100, ], rownames = FALSE)

Ambiguities Everywhere

The Pictures Table

This is a table recording all of these users’ pictures.

pics <- fread(input = "Pictures.csv")
pics[, .N]
[1] 1724000
datatable(data = pics[1:100, ], rownames = FALSE)

Fields of the Pictures Table

The Emails Table

emails <- fread(input = "Emails.csv")
emails[, .N]
[1] 1280905
datatable(data = emails[1:100, ], rownames = FALSE)

Fields of the Emails Table

Notice that the email still exists in the company’s database but is no longer displayed in the user’s messages if it has been deleted.

Privacy Considerations

The Time Frames

min.and.max <- function(x, na.rm = TRUE) {
    require(data.table)
    return(data.table(Min = min(x, na.rm = na.rm), Max = max(x, 
        na.rm = na.rm)))
}
pic.date.name <- "Date_Uploaded"
pics[, min.and.max(get(pic.date.name))]
                    Min                  Max
1: 2017-09-01T04:00:00Z 2017-10-02T03:59:59Z
email.date.name <- "Date_Sent"
emails[, min.and.max(get(email.date.name))]
                    Min                  Max
1: 2017-09-01T04:00:02Z 2017-10-04T01:50:38Z

All of the pictures and messages were added to the site in a time frame encompassing September and early October of 2017.

Linkages: Pics and Emails to Users

owner.name <- "Owner"
user.id.name <- "id"
pics[, mean(get(owner.name) %in% users[, get(user.id.name)])]
[1] 1
sender.name <- "Sender"
receiver.name <- "Receiver"
emails[, mean(get(sender.name) %in% users[, get(user.id.name)] & 
    get(receiver.name) %in% users[, get(user.id.name)])]
[1] 1

All of the pictures were owned by those in the users table, and all of the emails were between pairs in the users table.

Linkages: Users to Pics and Emails

users[, mean(get(user.id.name) %in% pics[, get(owner.name)])]
[1] 1
users[, mean(get(user.id.name) %in% emails[, get(sender.name)])]
[1] 0.9998
users[, mean(get(user.id.name) %in% emails[, get(receiver.name)])]
[1] 1

Every user in the cohort uploaded at least one picture and received at least one email. A few users did not send any emails among the messages recorded here.

This May or May Not Be Complete Information

Getting the Data

How Much SQL, and How Much R?

There is no magic formula for deciding on how to extract the data and analyze it. Your choices will depend on:

My Approach to Data Extraction

This may or may not be the best approach, but it was one that provided a reasonable bridge between the structure of the database and my analyses in R.

Creating a Secure Connection

ssh -L 8080:the_complicated_cloud_domain_name 
my_laptop@vpn2.the_company_website.com -p 32722

When you run this kind of command, the server receives the information about the machine’s name, hardware address, and IP address. The network’s administrator has to register your account. If your information matches an account in the registry, then the secure connection is established. However, if I had tried to log in from a cafe, then the IP address would not match, and my access would be denied.

Entering Passwords in R

You can set up your program to allow a user to enter a password to proceed:

library(getPass)
password <- getPass(msg = "Enter the Password, please!")

Connecting to the Database

require("RPostgreSQL")
drv <- dbDriver(drvName = "PostgreSQL")
on.exit(dbDisconnect(conn = conn))

library(getPass)
username <- getPass(msg = "Enter the Username, please!")
password <- getPass(msg = "Enter the Password, please!")

conn <- dbConnect(drv = drv, dbname = "the_social_media_database", 
    host = "localhost", port = 8080, user = username, password = password)
rm(list = c("username", "password"))

This was connecting to a Postgre SQL database. Of course, there are many other varieties that may require a different library, driver, and connection protocol.

In practice, this is something that the organization’s engineering team should help you with as you get started. Each setting is likely to have its own protocols.

Running SQL Through R

An Extraction Function

extract.data.from.AWS <- function(query, n = NA, user = NA, 
    pw = NA, returnTime = FALSE, dbname = "dbname=the_database sslmode=require", 
    host = "localhost", port = 8090) {
    require(RPostgreSQL)
    require(data.table)
    drv <- dbDriver(drvName = "PostgreSQL")
    on.exit(dbDisconnect(conn = con))
    
    if (is.na(user)) {
        require(getPass)
        user <- getPass(msg = "Enter the username, please!")
    }
    if (is.na(pw)) {
        require(getPass)
        pw <- getPass(msg = "Enter the Password, please!")
    }
    con <- dbConnect(drv = drv, dbname = dbname, host = host, 
        port = port, user = user, password = pw)
    rm(pw)
    toc <- Sys.time()
    if (!is.na(n)) {
        query <- sprintf("%s LIMIT %d", query, n)
    }
    dat <- dbGetQuery(conn = con, statement = query)
    tic <- Sys.time()
    
    if (returnTime == TRUE) {
        print(sprintf("Extracting %d items required %.0f seconds", 
            nrow(dat), difftime(time1 = tic, time2 = toc, 
                units = "secs")))
    }
    dat <- setDT(dat)
    return(dat)
}

High Level Overview of the Extraction

We will not be directly using this extraction function or working with SQL in our class, but I have found it to be quite useful working with online databases. By generalizing the connection process, I can use this function to make any query. Then the bulk of my time can be spent focusing on the real problems – how to get the right table – instead of how to properly connect to the database.

Merging Information from Multiple Tables

Different Types of Merges

Option 2: Combining Tables in R

Any type of SQL Join can be performed in R using the merge command:

users_plus_pictures <- merge(x = users, y = pics, by.x = user.id.name, 
    by.y = owner.name, all.x = TRUE, all.y = TRUE)
datatable(data = users_plus_pictures[1:5], rownames = FALSE)

Merging Options

The combinations of all.x and all.y correspond to the different types of SQL Join operations.

Note: When x and y are data.table objects, then the merge function reverts to the merge.data.table function, which will perform the operation more efficiently.

Matching Join Types to Merging Types

Multiple ID Columns

There is another problem with the result of the merge we performed. The resulting table had two different id columns.

Unambiguous ID Columns

pic.id.name <- "pics_id"
setnames(x = pics, old = "id", new = pic.id.name)
age.name <- "Age"
state.name <- "State"

users_plus_pictures <- merge(x = users[, .SD, .SDcols = c(user.id.name, 
    age.name, state.name)], y = pics, by.x = user.id.name, 
    by.y = owner.name, all.x = TRUE, all.y = TRUE)
datatable(data = users_plus_pictures[1:5], rownames = FALSE)

Databases: Fun with Counting

Counting Records

pic.counts.by.user <- users_plus_pictures[, .N, keyby = user.id.name]
pic.counts.by.user[, summary(N)]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  121.0   163.0   172.0   172.4   181.0   238.0 

Daily Uploads

pic.date.name <- "Date_Uploaded"
calendar.date.name <- "Calendar Date"
users_plus_pictures[, eval(calendar.date.name) := as.Date(get(pic.date.name))]
the.counts <- users_plus_pictures[get(age.name) < 35 & get(state.name) == "CA", .N, keyby = calendar.date.name]

Plotting Daily Uploads

plot(as.formula(sprintf("N ~ `%s`", calendar.date.name)), 
    data = the.counts, type = "l", xlab = "Date", ylab = "Number of Pictures Uploaded", 
    ylim = c(0, 1.3 * max(the.counts[, N])), cex = 0.8, 
    las = 1, main = "Daily Uploaded Pics in CA under 35")

And What Happened on October 2nd?

pics[, max(get(pic.date.name))]
[1] "2017-10-02T03:59:59Z"

Daily Active Users

daily.active.users <- users_plus_pictures[get(calendar.date.name) < 
    max(get(calendar.date.name)), .(`Daily Active Users` = length(unique(get(user.id.name)))), 
    keyby = calendar.date.name]

Plotting Daily Active Users

plot(as.formula(sprintf("`Daily Active Users` ~ `%s`", calendar.date.name)), 
    data = daily.active.users, type = "l", xlab = "Date", 
    ylab = "Daily Active Users", cex = 0.8, las = 1, main = "DAU: Uploaded Pictures")

Counting Unique Values

length.unique <- function(dat, value.name, by = NA) {
    require(data.table)
    this.dat <- setDT(x = dat)
    if (length(by) > 0) {
        if (!is.na(by[1])) {
            if (by[1] == "") {
                by[1] <- NA
            }
        }
    }
    if (is.na(by[1])) {
        dat.u <- unique(this.dat, by = value.name)
        ans <- dat.u[, .(N = .N)]
    }
    if (!is.na(by[1])) {
        dat.u <- unique(this.dat, by = c(by, value.name))
        ans <- dat.u[, .N, keyby = by]
    }
    return(ans)
}

Day Active Users by Age Group

age.group.name <- "Age Group"
library(Hmisc)
users_plus_pictures[, eval(age.group.name) := cut2(x = get(age.name), cuts = c(18, 25, 35, 45, 65, 120))]
dau.by.state <- length.unique(dat = users_plus_pictures, value.name = user.id.name, by = c(state.name, age.group.name, calendar.date.name))
datatable(data = dau.by.state[get(state.name) == "NY",][1:100,], rownames = FALSE)

Counting Sent Emails

How many emails did the typical user send? Count the number for each user. Show a summary of the distribution. Keep in mind that some users may have sent zero emails (and would not appear as senders in the Emails table).

Summarizing the Counts of Sent Emails

email.count <- function(emails, by, id.name){
  email.count.by.sender <- emails[, .(Num.Emails = .N), by = by]
  
  the.emails.count <- merge(x = users[, .(id)], y = email.count.by.sender[, .(id = get(by), Num.Emails)], by = id.name, all.x = TRUE, all.y = FALSE)

  the.emails.count[is.na(Num.Emails), Num.Emails := 0]
  return(the.emails.count)
}
email.id.name <- "id"
sent.emails <- email.count(emails = emails, by = sender.name, id.name = email.id.name)
summary(sent.emails[, Num.Emails])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    87.0   127.0   128.1   168.0   289.0 

Counting Received Emails

How many emails did the typical user receive? Count the number for each user. Show a summary of the distribution. Keep in mind that some users may have received zero emails (and would not appear as receivers in the Emails table).

Summarizing the Counts of Received Emails

received.emails <- email.count(emails = emails, by = receiver.name, 
    id.name = email.id.name)
summary(received.emails[, Num.Emails])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   45.0   102.0   127.0   128.1   153.0   244.0 

Assessing Influence

Most people send about as many e-mails as they receive. One measure of a person’s influence is how much attention they receive. Those who receive many more messages than they send are in demand. They are potentially influential. For each user, what is the difference between the number of messages received and the number sent? Show a summary of this difference as a measure of the user’s influence.

Summarizing Influence

users.plus.sent <- merge(x = users[, .SD, .SDcols = user.id.name], 
    y = sent.emails[, .(id, Num.Emails.Sent = Num.Emails)], 
    by = user.id.name)
users.email.counts <- merge(x = users.plus.sent, y = received.emails[, 
    .(id, Num.Emails.Received = Num.Emails)], by = user.id.name)
users.email.counts[, Received.Minus.Sent := Num.Emails.Received - Num.Emails.Sent]
summary(users.email.counts[, Received.Minus.Sent])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    -92     -18       0       0      19      83 

Do Influential Email Receipients Post More Pictures?

What is the correlation between receiving more emails than you send and posting more pictures?

Here is a plan to answer this question:

Merging the Counts and the Received_Minus_Sent

setnames(x = pic.counts.by.user, old = "N", new = "Num.Pics.Uploaded")
dat <- merge(x = pic.counts.by.user, y = users.email.counts, 
    by = user.id.name, all = TRUE)

dat[, cor(x = Num.Pics.Uploaded, y = Received.Minus.Sent, 
    use = "pairwise.complete.obs")]
[1] 0.00309792

In this case, simply uploading a lot of pictures does not mean that you will start to receive more messages than you send.

Problems with Relational Databases on the Cloud

  1. New data and records are constantly arriving.

  2. The information you want to connect may not be directly related. You may have to perform multiple JOIN operations to get what you’re looking for.

  3. Every dynamic extraction requires downloading the results, which can be time consuming.

We will discuss all of these problems in greater detail.

Issue #1: The Influx of New Data

Design Choices

In short, a database always provides a record of the current state of the data in that moment. However, you may or may not have access to a full historical record of the changes that took place.

An Example: Early Usage Analyses

Reproducibility in Databases

Issue #2: Indirectly Connected Information

Example: Indirectly Linked Tables

The site has grown its community by advertising on other sites. How many new users have come from each site? It seems like a simple question.

Linking Organizations to Users

Perhaps this is not such a simple question after all.

The Referrals Table

referrals <- fread(input = "Referrals.csv")
datatable(data = referrals[1:100], rownames = FALSE)

The Campaigns Table

campaigns <- fread(input = "Campaigns.csv")
datatable(data = campaigns[1:100, ])

The Organizations Table

organizations <- fread(input = "Organizations.csv")
datatable(data = organizations, rownames = FALSE)

Connecting Campaigns to Organizations

campaigns_plus_orgs_all <- merge(x = campaigns, y = organizations, 
    by.x = "Organization_id", by.y = "id", all.x = TRUE, 
    all.y = FALSE)
campaigns_plus_orgs <- campaigns_plus_orgs_all[, .(Campaign_id, 
    organization = Name)]
datatable(data = campaigns_plus_orgs, rownames = FALSE)

Connecting Referrals to Organizations

referrals_plus_orgs_all <- merge(x = referrals, y = campaigns_plus_orgs, 
    by = "Campaign_id")
referrals_plus_orgs <- referrals_plus_orgs_all[, .(User_id, 
    organization)]
datatable(data = referrals_plus_orgs)

Connecting Users to Organizations

users_plus_orgs <- merge(x = users, y = referrals_plus_orgs, 
    by.x = "id", by.y = "User_id")
datatable(data = users_plus_orgs, rownames = FALSE)

An Indirect Path

Option 1: Nested Merges

We could do all of the merges in one very complicated step:

users_plus_orgs_2 <- merge(x = users, y = merge(x = referrals, 
    y = merge(x = campaigns, y = organizations, by.x = "Organization_id", 
        by.y = "id", all.x = TRUE, all.y = FALSE)[, .(Campaign_id, 
        organization = Name)], by = "Campaign_id")[, .(User_id, 
    organization)], by.x = "id", by.y = "User_id")
setnames(x = users_plus_orgs_2, old = "id", new = "User_id")

Option 2: multi.merge

multi.merge <- function(objects, by = NULL, by.x = NULL, 
    by.y = NULL, all = FALSE, all.x = all, all.y = all, 
    sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE, 
    allow.cartesian = getOption("datatable.allow.cartesian"), 
    ...) {
    require(data.table)
    num.objects <- length(objects)
    if (!is.null(by)) {
        by.x <- by
        by.y <- by
    }
    if (length(by.x) == 1) {
        by.x <- rep.int(x = by.x, times = num.objects - 
            1)
    }
    if (length(by.y) == 1) {
        by.y <- rep.int(x = by.y, times = num.objects - 
            1)
    }
    if (length(all.x) == 1) {
        all.x <- rep.int(x = all.x, times = num.objects - 
            1)
    }
    if (length(all.y) == 1) {
        all.y <- rep.int(x = all.y, times = num.objects - 
            1)
    }
    res <- setDT(objects[[1]])
    for (i in 2:num.objects) {
        res <- merge(x = res, y = setDT(objects[[i]]), by.x = by.x[i - 
            1], by.y = by.y[i - 1], all.x = all.x[i - 1], 
            all.y = all.y[i - 1], sort = sort, suffixes = suffixes, 
            allow.cartesian = allow.cartesian, ...)
    }
    return(res)
}

multi.merge Example

users_plus_orgs_3 <- multi.merge(objects = list(campaigns, 
    organizations[, .(id, organization = Name)], referrals, 
    users), by.x = c("Organization_id", "Campaign_id", "User_id"), 
    by.y = c("id", "Campaign_id", "id"), all.x = TRUE, all.y = FALSE)[, 
    .SD, .SDcols = names(users_plus_orgs_2)]

mean(users_plus_orgs == users_plus_orgs_2)
[1] 1
mean(users_plus_orgs_2 == users_plus_orgs_3)
[1] 1

While not necessarily any faster, using multi.merge can significantly cut down on the work you have to do to structure nested merges.

Analytics on Referring Organizations

barplot.output <- users_plus_orgs[, barplot(table(organization), 
    las = 2, cex.names = 0.4, ylim = c(0, 600), main = "Referrals by Organization")]

Characteristics of Users Referred by an Organization

users_plus_orgs[get(state.name) == "CA" & organization == 
    "Organization_Name_1", .(Mean_Age_CA = round(x = mean(x = get(age.name), 
    na.rm = T), digits = 1))]
   Mean_Age_CA
1:        43.9

Answers Are Possible But Difficult to Generate

A Disorienting Feeling

Issue #3: Pitfalls of Cloud Computing

It is up to you to find a way to make your working environment practical and efficient. We’ll discuss a few options for solving these problems.

Option #1: Periodic Downloads

Advantages to Periodic Downloads

Disadvantages to Periodic Downloads

In one example, I was developing some investigations while the data engineer was creating new variables to display in the tables. From time to time, the engineer would reorganize the data – changing the names of variables or moving information from one table to another. This would often happen without a process of informing me of the changes. Had I stuck with periodic downloads, I would have been developing my work on an obsolete structure.

Option #2: Working on the Cloud

Advantages of Working on the Cloud

Disadvantages of Working on the Cloud

Option #3: Prototyping with Small Samples

Making Do

Option #4: Hire a Data Engineer

Hey, we’ve said all along that data engineering is anything that a data scientist doesn’t want to do and can persuade a data engineer to do instead. If the databases are too challenging, it might impact the ability of your team to perform the intended analyses. Contributions from a data engineer who could handle some of these technical issues might free up the data scientists to derive insights for the organization. It’s OK to ask for help, and it might be better for the project overall.

Periodic Updating Wins Out – When Possible

But What About Real Time Systems?

A Plan for Social Media Analytics

Pre-Processing

Saving the Image

  1. Separate files for each table.
  2. A combined .RData file for the entire list of tables.
save(list = ls(), file = sprintf("Social_Media_Data_%s.RData", 
    Sys.Date()))

Using list = ls() saves all of the variables (equivalent to the save.image() function). But you can specify a smaller number of variables to be included in the list.

Maintaining Security

An End Run Around Databases

Now We are Set Up for Investigations

And On and On

We will continue exploring these themes in the next lecture.